Secure tumor classification by shallow neural network using homomorphic encryption

Background Disclosure of patients’ genetic information in the process of applying machine learning techniques for tumor classification hinders the privacy of personal information. Homomorphic Encryption (HE), which supports operations between encrypted data, can be used as one of the tools to perform such computation without information leakage, but it brings great challenges for directly applying general machine learning algorithms due to the limitations of operations supported by HE. In particular, non-polynomial activation functions, including softmax functions, are difficult to implement with HE and require a suitable approximation method to minimize the loss of accuracy. In the secure genome analysis competition called iDASH 2020, it is presented as a competition task that a multi-label tumor classification method that predicts the class of samples based on genetic information using HE. Methods We develop a secure multi-label tumor classification method using HE to ensure privacy during all the computations of the model inference process. Our solution is based on a 1-layer neural network with the softmax activation function model and uses the approximate HE scheme. We present an approximation method that enables softmax activation in the model using HE and a technique for efficiently encoding data to reduce computational costs. In addition, we propose a HE-friendly data filtering method to reduce the size of large-scale genetic data. Results We aim to analyze the dataset from The Cancer Genome Atlas (TCGA) dataset, which consists of 3,622 samples from 11 types of cancers, genetic features from 25,128 genes. Our preprocessing method reduces the number of genes to 4,096 or less and achieves a microAUC value of 0.9882 (85% accuracy) with a 1-layer shallow neural network. Using our model, we successfully compute the tumor classification inference steps on the encrypted test data in 3.75 minutes. As a result of exceptionally high microAUC values, our solution was awarded co-first place in iDASH 2020 Track 1: “Secure multi-label Tumor classification using Homomorphic Encryption”. Conclusions Our solution is the first result of implementing a neural network model with softmax activation using HE. Also, HE optimization methods presented in this work enable machine learning implementation using HE or other challenging HE applications.


Background
Cancer is a disease caused by the unlimited proliferation of certain cells in the human body, and the exact cause of cancer is not yet known. In 2020 alone, 19.3 million new cases were reported worldwide, and 10 million people died because of cancer [1]. For this reason, the prediction of cancer through genetic data analysis has been regarded as one of the most important tasks since early treatment can reduce the lethal effects of cancer on the human body.
Machine learning (ML) is one of the fields of artificial intelligence that learns the process of finding solutions on its own without human assistance to a given problem. Due to the difficulty in the process of diagnosing tumors through genes, tumor classification using ML-based on large amounts of genetic data contributes to decision making to diagnose and treat cancer. Some cancer genome studies using ML techniques on large-scale data have shown the relationship between genetic modification and specific cancer types [2][3][4][5].
Since genetic data contains a lot of personal information and cannot be discarded or changed even if it is leaked, it is essential to protect the privacy of information about genetic data in the data analysis using ML. Various techniques, including differential privacy [6,7] or multi-party computation [8], have been used to ensure privacy in the data analysis process. However, each of these approaches has the disadvantage of losing accuracy in the process of anonymizing the data or requiring multiple phases in the process of data sharing.
Homomorphic Encryption (HE) is a cryptographic scheme that enables us to perform arithmetic operations between encrypted data without decryption. HE has been considered one of the useful applications for privacy-preserving ML, as it allows the computation of desired operations without disclosing information about the data [9,10]. However, most HE libraries [11][12][13][14] mainly support only addition and multiplication of arithmetic operations. Although linear operation in ML, such as matrix-vector multiplication, can be easily computed by using a suitable data packing method [15], many activation functions widely used in neural networks, such as sigmoid, ReLU, or softmax functions, cannot be directly operated for encrypted data. Most of the applications using HE overcomes such problems by approximating non-arithmetic operations with a polynomial that minimizes the error in a certain interval. The main problem with applying HE is to minimize computational increases occurring during the polynomial approximation process while limiting errors.
Integrating Data for Analysis, Anonymization and SHaring (iDASH) has held an annual secure genome analysis competition since 2014. Each year, important topics in the field of genetic analysis are selected to compete for the most effective solution. The problem of multi-label tumor classification using HE was one of the three tasks for the 2020 iDASH competition. Given the dataset with a total of 2,713 patients and their 25,128 genes, participants had to preprocess the given data, train the ML model with plain data, and obtain the highest microAUC score [16] within 5 minutes when the inference step was performed with encrypted test data of 909 patients.

Related works
Integrated analysis, which means integrating and classifying different types of data for samples in the same cohort, was developed with the emergence of ML techniques. Now various ML techniques are used in multilevel omics data integration as reviewed in [17][18][19][20][21][22] and classified depending on the learning method, data integration method, and feature selection method: supervised and unsupervised learning; horizontal and vertical data integration; supervised and unsupervised feature selection.
In the field of cancer type classification based on somatic mutations, many ML techniques are used to build suitable multi-label classifiers. Classifiers using unsupervised learning method such as cluster analysis exists [23]; however, supervised learning-based classifiers are more in our interest because accuracy can be increased substantially with the labeled data. Chen et al. [24] used a supervised learning technique named Support Vector Machine (SVM) to classify cancer types of given somatic mutation samples.
Yuan et al. [4,25] proposed DeepGene and DeepCNA, which are multi-label cancer classifiers based on Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN), respectively. In particular, DeepGene uses a feature selection technique called Clustered Gene Filtering (CGF) based on cluster analysis. Sun et al. [26] also used DNN model with 5 layers and reached 70.1% classification accuracy. Some classifiers use multiple ML techniques and ensemble them to reach higher accuracy. Lee et al. [27] introduced a classifier named CPEM, an ensemble of two ML techniques, Random Forest (RF) and DNN, and reached 84.7% accuracy. However, these previous classification techniques do not consider privacy protection and sensitive genetic data leaks to the untrusted classifier owner, unless the classifier owner gives his whole model to the client.
As a solution of privacy-preserving ML for genetic data, ML over encrypted data with HE is drawing attention with the annual iDASH competition [9,10,[28][29][30][31][32][33][34][35]. Kim et al. introduced a privacy-preserving logistic regression model over HE [28]. The model uses modified Nesterov's accelerated gradient descent method to reduce the number of iterations, since the depth of the homomorphic circuit highly affects to the computational cost. The classifier was selected as the best solution of Track 3 at iDASH competition 2017, where similar approaches were also proposed [9,29,30].
Some parallelized versions submitted for the second track of iDASH 2018 competition [10,[31][32][33][34]. In 2019, the HE task of iDASH competition was a secure genotype imputation using HE. The solution showed "ultra-fast" HE models to verify genotype imputation that took less than 10 seconds for evaluation with only a 2-3% decrease in accuracy [35]. The authors also have confirmed that similar results can be obtained for various HE libraries, including BFV [36], CKKS, and TFHE [37].

Difficulty of ML using HE
Some limitations of computation using HE make the general ML techniques not directly applicable to the ML over HE since they are yet insecure nor impractical for some reasons. First, as mentioned above, most HE libraries [11][12][13][14] mainly support only addition and multiplication of arithmetic operations. Thus, HE requires a polynomial approximation of non-polynomial functions such as ReLU, Sigmoid, softmax, and even division and comparison and those greatly amplify the amount of computation over HE. However, the supervised ML models [4,[25][26][27] generally have large circuit depths for high accuracy and use non-polynomial functions such as ReLU or max-pooling, which have high time complexity when implemented with HE.
To overcome the above problems, the ML over HE is being studied in various ways, such as replacing a nonpolynomial such as sigmoid in logistic regression with a simple polynomial [28], or finding an algorithm consisting of only polynomial operations although it is less efficient than state-of-the-art algorithms involving nonpolynomials [9]. In our case, we propose the efficient polynomial approximation of softmax function and our algorithm with HE that operates within a reasonable time.
Secondly, preprocessing methods included in the classifiers can leak personal information. To prevent leakage, preprocessing should be done by the client herself or the server in an encrypted state. However, clients with low computational capabilities cannot follow heavy preprocessing methods [24,27]. Also in the case of servers, they should compute only with the data encrypted with HE, so the preprocessing will be very impractical due to logical/non-polynomial functions in the preprocessing as in [4,[24][25][26][27] when using the CKKS [14] HE scheme. Therefore, the classifiers with HE-friendly models and light and secure preprocessing are important for privacypreserving classification.
In addition, if preprocessing method can be easily done by the server using above properties, there is another advantage to the client that does not have to compute preprocessing. In this case, the data can be used for various training models and so the client only needs to encrypts the original data and share them. Table 1 summarizes the previous multi-label tumor type classifications and their weaknesses when applied to the privacy-preserving classification scenario. HE-friendly Model indicates the applicability of the model over HE, and Client and Server (HE) indicate the hardness of secure preprocessing as mentioned above. Table 1 Multi-label classification of tumor type based on somatic mutation data. In the dataset, the numbers in parentheses are the number of tumor types used in classifications. The hardness of the client-side preprocessing is mainly due to high memory use, and the server-side preprocessing relies on the CKKS scheme  1 Illustration of our scenario for training and inference steps. In training step, our data filtering method reduces the size of raw data dramatically. Then, we use 10-fold cross validation to train shallow neural network (SNN) model using filtered data. As a result, we get the optimized parameters for both filtering and SNN model. In inference step, the protocol works as in the scenario using HE introduced in Background section.
Since we use filtering method in preprocessing, only some chosen genes are needed to be encrypted in ciphertext. Then, we run HE-friendly SNN algorithm to evaluate our model. As a result, the client can receive the evaluated value using their private data without revealing any information

Scenario and security model
The scenario in this paper follows the model presented in the HE task of iDASH 2020, which is also commonly considered in standard HE application models. In our scenario, two parties are involved in: the private data owner and the service provider, and they are simply denoted by client and server, respectively. The server trains the model using his data(or public large-scale data) and wants to provide a service to generate inference results, the prediction of the tumor type, from the client's encrypted data. The client has limited computational power and wants to outsource the tumor type prediction; however, she does not want to leak her sensitive data. In addition, the client's data needs to be preprocessed to fit the model computation while maintaining privacy in two possible ways: by the client with limited computation power or by the server in the encrypted state. Therefore, in our scenario, the server should not be aware of the client's data when preprocessing as well as computing the inference results. The detailed explanation of our scenario and method is illustrated in Fig. 1. Our protocol can be performed by using HE while ensuring client's data privacy. Let m be the client's private data and f be a function that computed the whole process of the model generated by the server. First, the client encrypts m by her own secret key and sends the ciphertext Enc(m) to server. Then, server evaluates his model using HE operations, so that he obtains the encrypted evaluation output Enc(f (m)) and sends it back to the client. Finally, the client performs HE decryption for Enc(f (m)) with the secret key to get the desired output f (m). Since the server can only access encrypted data, our protocol can achieve the desired security even on malicious servers.
In our implementation, all ciphertexts have a 128-bit security 1 level regardless of the remained number of operations. This means that the ciphertext will not leak any information in a lifetime even if all the computing power in the world is used. Our ciphertext is based on the Ring-Learning with Error (LWE) problem which is attracting attention as it seems secure although quantum computers are developed. And we set the HE parameters to achieve 128-bit security estimated by Albrecht's LWE estimator [38].

Goal of this work
Our goal is to perform privacy-protecting machine learning computations on encrypted tumor datasets. In general, there are two main obstacles to achieving this goal. The first point is the size of dataset. Most HE schemes that support multi-encoding are able to contain at most 2 16 or less number of data in a ciphertext with a practical parameter. Since typical genomic dataset consists of largescale matrix with millions of elements, a large number of ciphertexts are required to encrypt data and a huge computational overhead occurs in the operation. Therefore, a preprocessing process that reduces the size while maintaining the properties of the initial data is essential, and because of the security problem, this method should be performed in a HE-friendly manner.
The second point is that the machine learning model should be performed using HE. Since the operations allowed in HE is limited, especially addition and multiplication, most of deep and complex models cannot be implemented using HE in practical time now. Even in simple cases such as single-layer neural networks, since most activation functions use non-arithmetic operations such as comparison (ReLU) or exponential functions (sigmoid, softmax), additional time consuming and accuracy loss occurs in the process of computing approximation of these functions with polynomials.

Summary of results
In this paper, we propose a privacy-preserving multilabel classifier using a shallow neural network with a softmax activation function based on HE, which is also an outstanding solution of the first track of the iDASH 2020 competition. Our method is a supervised parallel integration method that uses unsupervised feature selection (cluster analysis) which provides the scalability on the parameter depending on the computational bound of HE. Our ideas can be categorized into three main subjects. First, we suggest data filtering method to reduce the size of the large raw data. This method is suitable for our scenario using HE because the data owner does not require additional operations other than filtering before encryption. Second, we modify the data packing method for matrix-vector multiplication [15] by duplicating the data in encryption to make a trade-off between the number of ciphertexts for packing and the number of rotations to reduce computational cost for HE. In multiplication between encrypted matrix and plain vector, the number of rotations occupies the most time cost, so we minimize the total time by choosing an appro-priate number of copies in the encryption step. Lastly, we use elementary exponential approximation method to evaluate the microAUC value. We minimize the loss of the approximated microAUC from the real value not by focusing on minimizing the error in the approximation of the exponential function, but by using an approximation that shares properties with the exponential function.

Dataset description
Our scenario focuses on the dataset originated from The Cancer Genome Atlas (TCGA) database, which is widely used in genomic research. The data of the somatic Single Nucleotide Variation (SNV) and gene-level Copy Number Variation (CNV) information for TCGA samples are downloaded from publicly available datasets [39] and [40], respectively. The training and testing data are generated based on the sample metadata downloaded from the TCGA project (available on 8/13/2020) and extracted the cases for all available cancer types. The samples first are filtered for the ones that exist in both SNV and CNV datasets. For each of the remaining cancer types, the data were randomly divided into training and testing datasets with 75% and 25% of all samples for the corresponding cancer type, respectively. Cancer types with less than 100 training samples are filtered out, and the resulting dataset consists of cancers from 11 sites: Bladder, Breast, Bronchus and Lung, Cervix uteri, Colon, Corpus uteri, Kidney, Liver and Intrahepatic bile ducts, Ovary, Skin, and Stomach.
The dataset consists of total 3,622 samples (2,713 for train, 909 for test) and 25,128 genes, and each sample has one cancer type out of 11 types of cancer, consists of two types of data: Copy Number (CN) data and Variants data. The dataset is available in [41] which is generated in the same way with iDASH 2020 competition track I and the details of dataset including the number of samples and the number of mutation's effects for each feature are disclosed in Table 2.
CN data consists of the copy number of the genes. The copy number data show the copy number variations (CNVs) of the genes as numbers, representing the state of the gene on the corresponding sample, whether the gene has duplication or deletion of a considerable number of base pairs. The data consist of 5 different levels 0, ±1, ±2 where the negative and positive values represent deletion and duplication of the corresponding gene from the sample, respectively. Copy number with 0 means that the gene has no considerable variation on the corresponding gene from the sample.
Variants data consists of mutation data of selected pairs of samples and genes for each tumor type with various features: gene's location on chromosomes, mutation type, whether the mutation is Single Nucleotide Polymorphism (SNP), and mutation's effects separately predicted by two different methods.

Neural network model and parameter selection
For the training step, we first downsize each sample by using our proposed filtering algorithms, and used it for the training of our neural network model. More precisely, we feed the downsized samples into our shallow neural network model, which consists of one hidden layer with 64 nodes and linear activation function and output layer with 11 nodes. During the training step, we used batch size of 32, number of epochs of 50 and dropout rate of 0.9.
To suitably use the trained model in downstream tasks over encrypted data, we varied (d cn , k var ), the parameters for CN and variants dataset respectively, until the size of downsized data are less or equal to 2 B for each B from 9 to 12. In our algorithm, the size of model is determined by the size of filtered data under two parameters. For each B, we seek the best pair of parameters, d cn among {0, 0.01, · · · , 0.24} and k var among {0, 10, · · · , 590}; more precisely, we use 10-fold cross validation to find the (d cn , k var ) pair with best microAUC such that the size of model is less or equal to 2 B . After selecting the parameters, d cn and k var , we train the model on the entire training dataset, and feed it into the inference step over encrypted data.
For the inference step over encrypted data, we adopt HEaaN library, the implementation of CKKS scheme [14]. The CKKS parameters are chosen by the ring dimension 2 17 and the ciphertext modulus 2 2670 . We used signed binary secret, which satisfy more than 128-bit security according to Albrecht's LWE estimator [38]. For the scal-ing factor of CKKS, we choose the scaling factor by 2 60 for ciphertexts and 2 40 to encode plain vectors for constant multiplication or masking vectors.
For the approximation of Softmax, we use Goldschmidt algorithm with M = 80 and d = 30 for the inversion, and for the exponential function, we use (r, L) = (4, 32) for B = 9, 10 and (4, 64) for B = 11, 12, respectively. All experiments were performed on Intel Xeon CPU E5-2620v4 at 2.10GHz processor and used 8 threads. The detailed description of CKKS parameters is stated in Methods section and we refer Algorithm 4 for the parameters used in our algorithms.
Our codes for the training step can be found in https:// github.com/jaihyunp/iDash2020, and those for the inference step can be found in docker repository swanhong/idash2020.

Training model with plain dataset
Gradually changing two parameters, our results are presented in Figs. 2 and 3. Figure 4 visualizes how (d cn , k var ) pair determines the size of model. The model tends to show the best performance in terms of microAUC on d cn ≈ 0.08, and larger k var tends to show a better performance. However, to optimize the computational cost on the privacy-preserving inference step based on homomorphic encryption, we seek the best parameters (d cn and k var ) for the model of size less or equal to 2 B for each B = 9, 10, 11, and 12.
The best (d cn , k var ) pair for each size of model, 2 B , and its performance on 10-fold cross validation is given in Table 3.

Fig. 2
Illustration of microAUC of each model during 10-fold cross validation with given pair of (d cn , k var ) . The model tends to show a good performance on d cn ≈ 0.08, and larger k var shows a better microAUC near d cn = 0.08. As we visualize in the graph with contours, larger d cn and k var accompanies a larger size of the model

Encrypted inference results
Using filtered gene in preprocessing step and the model built in training step, we compute inference step over encrypted data as stated in previous sections. As a result, we estimate the time cost for each round-trip step including encryption, constant multiplication, rotate and sum-mation, Approximate softmax evaluation, and decryption step. The results are stated in Table 4. In the table, the column Encoding Duplication means the parameter m that we used as the number of duplication in encryption. As m increases, the number of required ciphertexts increases linearly so the time cost for encryption also   Fig. 5 for B = 10. Moreover, we state the final microAUc value and accuracy for each B in the table. In summary, we get about 0.988 microAUC and 85% accuracy for test dataset, which is the relatively better result compared to other works using the dataset from same data source [4,26].

Discussion
Our work in this paper was awarded fist place (with other teams including Desilo, Inpher and SamsungSDS) in the HE track of iDASH 2020 competition. Unlike other award winners who obtained relatively low microAUC values (close to 0.95) by performing linear activation within a very short time, our result shows a high microAUC value by proposing the only method for applying softmax activation within a practical time. In addition, our preprocessing method has scalability, making it possible to obtain a higher microAUC value through additional time consumption, which is different from other teams.
The main purpose of our data preprocessing is to remove irrelevant genes in the data. The genes removed have a similar effect to the remaining genes, or are have a weak effect on the cancer type classification. CN data filtering extracts the genes with the hamming distance of the samples' copy number, based on the fact that the adjacent genes have copy number similarity. This filtering technique can be applied to other types of the genetic dataset with an appropriate threshold d cn for each dataset. Variants data filtering can be also applied to other types of Table 3 The performance of correctness with the best pair of (d cn , k var ) for each threshold of the size of the model among training dataset. The best pairs are chosen from the result in Fig. 2 that shows best microAUC score in the same thresholds. We note that 10-fold cross validation is used on the training set   [4,26] would be able to show better accuracy using the same dataset. To our best knowledge, our softmax approximation algorithm is the first approach to use the softmax activation function for the neural network implemented with HE. However, the size of the input data must be adjusted through normalization, and it should be preceded that the process of predicting the range of the maximum and minimum values of the matrix product XW as a result of the train data to compute the approximation of exponential function and the Goldschmidt algorithm. In addition, our algorithm may not be suitable for computing deep neural network models with multiple softmax activation functions for a limited time, as it consumes a lot of HE depth in approximating the softmax. In further research, we expect that other popular activation functions, such as sigmoid or ReLU, can be combined with our neural network model to improve the final score.

Conclusions
In this paper, we propose the first result of privacypreserving multi-label classification for tumor data using neural network with softmax activation. To enable implementation using HE within a practical time, we suggest filtering method to reduce the size of genes from 25,128 to pre-fixed bound 2 B where B is in {9, 10, 11, 12}. For encoding filtered dataset to ciphertexts, we suggest the duplicating method that encodes same data multiple time, which decreases the time cost for matrix-vector multiplication for evaluating neural network model and provides a time cost trade-off between encryption step and message rotation step. Also, our approximation method for the softmax function enables to apply the softmax activation function for shallow neural network model. As a result, we obtain inference results with microAUC values of about 0.988 to classify multi-label tumor data in 5 minutes. If the time given for the inference step is not limited, our preprocessing and approximation methods can be used as building blocks for general deep neural networks.

Notations
All logarithms are base 2 unless otherwise indicated. The vectors are denoted with upper arrow. We denote an entry of the vector by using the same character with index. We denote Hadamard multipcation between two vectors by a b. Also, for simplicity, we denote the [ x] n be the number in {1, · · · , n} satisfying [ x] n = x mod n.
For a matrix A, A(i, j) means the i th row and j th column element of A. Also, we denote the submatrix of A with i ∈ I th rows and j ∈ J th columns from A by A[ I, J]. In this case, the colon(:) indicates the whole index set.

Approximate homomorphic encryption
Since HE enables anaylsis of encrypted data while preserving the privacy of message from operators, it is considered as one of the beneficial tool to delegate operations that requires sensitive data without revealing any information. Unlike other popular cryptographic tools including multiparty computation, which require protocol participants to continuously interact to each other through the scenario, HE has the advantage that no additional actions or online processes are required for the message owners after they encrypts and transmits data to the operator. Through these characteristics, HE has been exploited in various field such as ML that requires computation using sensitive information such as genomic data [35,42,43] or financial data [44]. Additionally, HE can also play an important role in the protection of data in the computation process in applications that require computations between real data, such as ML [45] or cyber physics system [46].
Since Gentry firstly suggested in his blueprint [47] in 2009, a number of HE schemes have been proposed to achieve useful properties for applications. Each scheme has advantages in operations in a particular message space, such as finite field operations [12,36] or boolean circuit [37]. However, many well-known deep learning algorithms [48,49] cannot be directly implemented because of some limitations of HE. First, since most HE methods only support multiplication operations less than fixed number of depths, it is difficult to implement a method using a large number of layers in a neural network. Therefore, most HE applications focus on implementing shallow neural networks or logistic regression. Although it is possible to recover the depth of the ciphertext through an operation called bootstrapping, it requires a very high computational overhead compared to other basic operations. Second, each scheme does not support both common arithmetic operations and binary(or logistic) operations at the same time. While many ML algorithms require both operations such as matrix-vector multiplication or ReLU function, the application of the HE scheme requires the way to efficiently perform unfavorable operations.
The approximate HE scheme, namely as CKKS scheme, is proposed by Cheon et al. [14]. The main feature of approximate HE scheme is that it deals with operations in complex numbers C and it supports fixed-point arithmetic operations between encrypted data. It also supports approximate arithmetic, which considers the noise of the ciphertext as part of the message to increase the efficiency of the operation. Since most ML algorithms mainly use fixed-point operations on real data or noise-friendly algorithms such as gradient descent, CKKS scheme takes advantages of the most of ML applications compared to other HE schemes.
We remark that the word approximate does not mean that the homomorphic operations contain large errors and result in data loss, but rather that the very small errors are allowed in the message to increases efficiency of the operations in the scheme. In machine learning applications, such errors does not ruin the message as the scheme guarantees a sufficiently large precision if practical parameters are used.
For the rest of this section, we formally describe CKKS scheme. Let L be a level parameter that a fresh ciphertext is equipped with a modulus q L = (2 ) L , and q := (2 ) for 1 ≤ ≤ L for some scaling factor . Let R := Z[ X] /(X N + 1) be a cyclotomic ring for a powerof-two N and R q be a modulo-q quotient ring of R, i.e., R q = R/qR. The distribution χ enc and χ err denote the discrete Gaussian distribution with some fixed standard deviation. The distribution χ key outputs a polynomial of {−1, 0, 1}-coefficient. We denote the rounding function · and modulo-q operation [ ·] q . CKKS scheme uses a plaintext vector m ∈ C N/2 and provides enrty-wise operation, called Single-Instrument-Multiple-Data (SIMD) operation, such as addition, substitution, and Hadamard multiplication between vectors. To encrypt complex value, CKKS uses a field isomorphism τ : R[ X] /(X N + 1) → C N/2 called canonical embedding.
-Sample s ← χ key and Set the secret key as sk = (1, s). -Sample a ← U(R q L ) and e ← χ err . Set the public key as pk = (b, a) ∈ R 2 q L where b =[ −a · s + e] q L .
For the remind of the paper, we may denote the operations between ciphertexts or ciphertext and plain vector by common symbols, such as Add(ct 1 , ct 2 ) = ct 1 + ct 2 or CMult(ct, c) = ct · c for simplicity.

Data preprocessing
Given more than 25,000 genes, meaningful gene selection is essential for efficient ML using HE. Therefore, we introduce a new data preprocessing technique consisting of a feature selection method modified from [4] to obtain significant genes useful for tumor prediction. Our data preprocessing method consists of two filtering algorithms that extract specific features from two different data types. The resulting filtered data matrix is concatenated as shown in Fig. 6(c) and inserted into our neural network. In the rest of paper, we denote the number of samples and genes from original data by s and G and the number of types of tumor by T. In our dataset, these notations contain values s = 2,713(train) or 909(test), G = 25,128 and T = 11. Using the abstract notation, the CN and Variants data can be understood as a matrix of the size s × G and the T number of set of information. Since the number of genes G is too large to handle with HE efficiently, we propose the preprocessing method which is easy to compute both client-side and server-side in our scenario to reduce the number of filtered gene less than fixed bound.

CN data filtering
Our main approach for CN data filtering is to filter out the irrelevant genes with cluster analysis based on the similarity of the neighboring genes. Modified from CGF in [4], we introduce a new cluster analysis using hamming distance instead of Jaccard distance. To reduce the number of input features we take only one gene from each cluster, unlike CGF.
Let C ∈ {0, ±1, ±2} s×G be the matrix of copy number data, where the s rows correspond to the s samples, and the g columns correspond to G genes; C(i, j) is a copy number that corresponds to gene j of sample i (see Fig. 6(a)). We initially sort the raw CN data according to the order of the gene positions on the chromosome. Then, we cluster the genes into groups by their copy number similarity (line 3-10 in Algorithm 1). For two gene columns p, q ∈ {0, ±1, ±2} 1×G , we use the Hamming distance as the similarity measure dist H ( p, q) = the number of i such that p i = q i (1) where p i and q i are the i th entry of vectors p and q, which represents the copy numbers of each gene for ith sample. Starting from the first gene in C, say a representative of the first group, we calculate the hamming distance with the following genes in order. Until the distance is less than the predetermined threshold d cn , we merge each corresponding gene to the first group. If the distance first reached the threshold d cn , the first group ends and the corresponding gene becomes a representative of the second group. Then start from the new representative, calculate the hamming distance with the following genes. Repeating this algorithm until the genes are all clustered, each gene is in a unique group with neighboring genes. Finally, we filter the CN data with the representative genes, the first gene in each group, resulting in a matrixC ∈ {0, ±1, ±2} s×g cn whereg cn is the number of groups as the result of our filtering. Note thatg cn is much smaller than the number of original genes. We formally state our CN data filtering algorithm in Algorithm 1 and shown in Fig. 6(a).

Variants data filtering
For Variants data filtering, the feature selection mechanism uses only a mutation's effect data classified in 4 different levels: LOW, MODERATE, MODIFIER, and HIGH. The Variants data filtering works based on the effectiveness of the mutations. It filters out the genes with less effective mutations.
In the Variants data, for each sample, only a few mutation effect data of individually selected genes are given. So the raw Variants data for whole samples and genes is almost empty, which cannot be used immediately as a neural network input. Hence our Variants data filtering is separately applied to each cancer type to reduce the empty spaces and fill in yet remaining spaces with zero.
Let V t be the matrix of the raw data with mutation's effect corresponding to the tumor type t (1 ≤ t ≤ T). V t is a s t × G matrix with mutation's effect as strings, where s t rows correspond to s t samples and G columns correspond to whole G genes for each t. V t (i, j) is the effect of the mutation in j th gene of i th sample with t (see Fig. 6(b)) . We first encode the string data to predetermined real values between 0 and 1: LOW = 0.2, MODERATE = 0.5, MODIFIER = 0.9, HIGH = 1.0, and 0 if no information exists. The resulting encoded Variants matrix is sparse matrix in E s t ×G , where E = {0, 0.2, 0.5, 0.9, 1.0} be a set of encoding values.
Secondly, we sum V t by each gene column and if the sum is more than the predetermined threshold k var , then put the gene in the list of the selected genes (line 2-8 in Algorithm 2). The column-wise summation s t i=1 V t (i, j) is a mutation's effect of jth gene to the s t samples, so the selected genes in list can be regarded as genes with considerable mutation effects. The union of the selected genes from each tumor type t is the set of filtered genes.
Finally, we filter the whole Variants data with the filtered genes for each t, resulting in a matrixṼ ∈ E s×g var with much smaller genes to whole samples The workflow of the Variants data filtering technique is shown in Fig. 6(b).

Algorithm 2: Variants data filtering
Input : encoded Variants data matrices V t ∈ E s t ×G for cancer types 1 ≤ t ≤ T, and threshold

Roadmap for training step over training data
After preprocessing for CN and variants data, we get the data matrix X =[C|Ṽ ] with size s × g where g = g cn +g var by concatenating two filtered results. Then, with the tumor data Y, we train a neural network model from (X, Y ) in plain while transform Y as the one-hot encoded label matrix with size s × T. Here, we note that T is 11 in our dataset, which is relatively small than s and g. Our neural network model consists of one hidden layer with 64 nodes and linear activation function and output layer with 11 nodes. In the output layer, we use softmax activation function to output their predicted value. During the training phase, we used Keras library [50] in Python with batch size of 32, number of epochs of 50 and dropout rate of 0.9. Note that we used sufficiently big dropout rate in order to avoid overfitting since the layers in our model are small.
To select the best parameters, d cn and k var , for the preprocessing on the given train dataset, we train and evaluate shallow neural network models using 10-fold cross validation with filtered data based on each pair of parameters. After selecting the best parameters, we train the shallow neural network model on the entire train dataset using the best parameters, and we use the trained model for the inference step over encrypted test dataset.

Roadmap for inference step over encrypted data
From now on, we state the method to compute inference step from our model with encrypted data. Precisely, input data matrix X with size s × g and weight matrix W with size g × T are given (recall that s, g and T refers the number of samples, genes, and tumors, respectively). Since T is relatively small than s and g in practice, we consider the matrix-matrix multiplication Y = X · W as matrix-vector multiplications y i = X · w i for 0 ≤ i < T, where w i is the (i + 1)th column of W. Then, for each row Y j of Y (0 ≤ j < s), we compute the softmax function to get final score of our model. Since the matrix Y is still encrypted, we need to compute an approximate function of softmax. In short, our Method can be divided into three steps: 1. Data Packing : encrypt the matrix X to a number of ciphertexts {ct i }. 2. Matrix-Vector Multiplication : using {ct i } and W, compute matrix-vector multiplication to get the ciphertext ct Y that contains Y = X · W . 3. Softmax Evaluation : compute approximate softmax function for ct Y to obtain softmax output for each row of Y.
We break the method down into the first two steps and the other, and explain the main ideas in each section.

Data packing and matrix multiplications
Although CKKS scheme supports SIMD operation between vectors, the SIMD operation cannot be directly applied to the matrix-vector multiplication operation. Therefore, in order to efficiently perform a matrix-vector multiplication operation using HE, a process of mapping a matrix to a number of vectors is required. In this section, focusing on the matrix multiplications in our scenario, we first state the naive approach and then suggests our optimization method. The main idea of mapping a matrix to vectors for this computation is suggested in [15], which is that X · w i can be understood as n times of slot-wise vector multiplication between diagonal part of X and w i . Concretely, we define x j by the j-th diagonal part of X so that x j = (X(k, [ j + k − 1] g ) 1≤k≤s for 0 ≤ j < g. For the repeated rotation of each column w i of W, we define with superscript with parentheses as w Then, the matrix-vector multiplication can be computed as i . Hence, the multiplication can be done if we encrypt x j 's in several ciphertexts and compute constant multiplication between ciphertext and plain vector w (j) i (see Fig. 7(a)).

Warm-up with toy example
Before we state our explicit method for encodings and matrix-multiplications, we start with a toy example, simplifying the parameters as s = 8, g = 4, and T = 4, and assumsing that the number of slots in one ciphertext is 4 times larger than the size of vector s. Then our goal Hence, the final sum is obtained by computing rotation twice as ct i = ct mult i + Rot(ct mult i ; 2s) and ct sum i = ct i + Rot(ct i ; s), where the real parts of first s slots in ct sum i contains y i (see Fig. 7(b)). From the naive approach, we need T number of the matrix-vector multiplication, so that the CMult and Rot should be computed T and 2T times, respectively. Here, the implementation cost of Rot is larger than CMult, so we suggest new packing method to reduce the number of rotation in the whole algorithm. Our main idea is to duplicate and encode X multiple times in each ciphertexts, which reduces the number of rotations since the number of summation between the slots in one ciphertexts is reduced. Now, we propose a new approach for matrix-vector multiplication to reduce the number of constant multiplications and rotations. Our main idea is that we copy each x j 's several times in encoding step. For the same condition as above example, we can encode the vectors x i twice so that we obtain two ciphertexts ct 0 = Enc( x 0 x 0 x 1 x 1 ) and ct 1 = Enc( x 2 x 2 x 3 x 3 ). In this case, we definew k,l as in Eq. (2).
for 0 ≤ k, l < 2. Then we get ct mult . Thus, by rotating and adding those ciphertexts as ct sum k = 1 l=0 ct mult k,l and ct Y ,k = ct sum k + Rot(ct sum k ; 2s), the result ciphertexts contain message vectors as in Eqs. (3) and (4), (4) which are y i 's that we desired. In this method, the operations we need are 4 CMults and 2 Rots, while 2 ciphertexts are required for encryption instead of 1 (see Fig. 7(c)).

Using imaginary part of message space
The CKKS scheme supports operations between complex numbers, but in applications using real numbers only such as neural networks, the imaginary part is never used. Here, we can make the computation in the encrypted state more efficient by using the imaginary part, which is not used in the plain operation.
In [35], the authors suggested the method to reduce the number of multiplications using CKKS scheme. For four real numbers a, b, c, and d, the sum of the two products ab + cd is equal to the real part of one complex product (a + ib)(c − id), where i = √ −1. Since the matrix-vector multiplication we need is also composed of the sum of Hadamard multiplications between vectors, we can reduce the number of ciphertext and the number of constant multiplications by half by combining each vector by two and encoding it into one complex number.

Rotate and sum algorithm
For the matrix-vector multiplication, the addition between data packed in one ciphertext is needed. We use RotSum algorithm, which repeatedly computes rotation and addition to get summation of desired slots in a ciphertext.
Precisely, we define the algorithm RotSum(ct, s, t, d) outputs the ciphertext ct out that is obtained by computing ct out ← ct and ct out ← ct out + Rot rk (ct out ; st j ) for j = 0, 1, · · · , d − 1 in order. As a result, if ct was an encryption of the message vector m = (m i ) 1≤i≤n , then ct out contains the message vector m out = .

Putting it all together
Combining all of the above methods, we explain the explicit method used for our implementation. For the rest of this section, we denote the concatenation of vectors by t i=s a i = ( a s a s+1 · · · a t ) and the repeated concatenation of the same vector by ( a) r = ( a a · · · a) (r times). Also, we can embed matrices X and W into large matrices with row and column lengths of power-of-2, respectively, so we assume that s, g are power-of-2's. Recall that our goal is to compute y j = g i=1 x i w (i) j for 0 ≤ j < t while x i 's are encrypted, so that the output y j 's are also encrypted.
As explained before, we copy each x i 's m times when encoding, where m is a power-of-2. Since each vector has the size s and the number of slots in a ciphertext is n, each ciphertext will contain = 2 · n ms number of different vectors (in both real and imaginary parts). Thus we need total g = m · sg 2n number of ciphertexts for encoding.
Precisely, we encode x i 's to each ciphertext ct i and plain vectorw ji by Eqs. (5) and (6) for 0 ≤ i ≤ g − 1 and 0 ≤ j ≤ T m − 1, respectively. Note that the massage vector in ct i sequentially contains m copies of vector x i· +2q + i · x i· +2q+1 for each q. Here we define w jm+r = 0 if jm + r ≥ T.
Next, we compute the multiplication step. For each j, we multiply ct i andw ji , add its conjugation and divide it by 2 to get ct ji = (ct i ·w ji ) + (ct i ·w ji ) /2. Then, by adding them to get ct sum j = g −1 i=0 ct ji , we get the ciphertext ct sum j which is the encryption of the vector in Eq. (7) for each j.
Finally, we compute the following summations using RotSum so that the first ms slots contain the desired sums and are copied to the remaining slots. The Eq. (8) indicates the result of vector operation from Eq. (7) since the summation for (i · + wq + k)'s for all indices q, i, and k works as the summation for all indices from 0 to q − 1. Hence, the result is the desired vector y jm+r .
Then for each j, we compute the result ct Y ,j that contains jth group of m outputs { y jm+r } 0≤r<m using RotSum operation. From Eq. (8), the resulting ciphertext ct Y ,j can be computed as Eq. (9).
As a result, the desired vectors y i = X · w i are contained in T m number of ciphertexts {ct Y ,j } 0≤j< T m . Note that the number of rotations used in the Eq. (9) is log 2 = log n ms for each j, so total log n ms · T m . In summary, if we duplicate x j 's m times for encryption, then the required number of rotations is reduced by m times, where the number of ciphertexts for encryption in increased by m times. Exactly, the number of rotations and multiplications are log n ms · T m and sg 2n · m T m , respectively, while the number of ciphertexts for encryption is increased to m · sg 2n (note that the value sg 2n implies the minimum number of ciphertexts to encrypt X). Since the number of multiplication does not change asymptotically for m, our method provides a time cost trade-off between the number of rotations and encryptions. In our implementation environment, we can reduce the total time cost by using more memory to encrypt data since the rotation cost is the significant part in total matrix-vector multiplication algorithm.

Approximation of softmax
While softmax layer substantially enhances the performance of the classification, it cannot be directly computed by CKKS scheme since it comprises several nonpolynomial operations. Recall that the softmax of a real vector v = (v 1 , · · · , v t ) is defined as Eq. (10).
, · · · , exp(v t )) (10) Both exponential function and division are not polynomial, so we should replace them by their polynomial approximation. We introduce a proper polynomial approximation technique for each of exponential and division function. In general, in order to approximate a nonpolynomial operation by a polynomial, minimax method that minimizes the maximum error value within an interval or Chebyshev approximation that express the function by the series of Chebyshev polynomials are mainly used. However, the minimax approximation loses the increasing property of exponential and the division function, because the sign of the error is not constant. Hence, this method is not suitable in terms of calculating microAUC score. Instead, we suggest an approximation method of the softmax layer that is more suitable for microAUC score. In particular, our approximation method consists of a less number of squaring operations, so it has an advantage to be evaluated over homomorphically encrypted data.

Approximation of exponential function
Our approximation method for exponential function comes from the elementary definition of exp(x). Precisely, for some r, we approximately compute exp(x) as Eq. (11).
Note that this formula can be computed by squaring n times, so it mitigates the computational overhead accompanied by the HE computation. Also, our approximation of exp(x) is monotone on [ −2 r , ∞), so it is more appropriate to approximate softmax functions in terms of getting a high microAUC score compared to other polynomial approximation techniques such as minimax and Chebyshev approximation. Furthermore, to make the evaluation more stable, we rather consider the approximation of scaled exponential function, AE r,L (x) as Eq. (12). From the definition, we expect this funciton to satisfy AE r,L (x) ≈ L 2 r −2 r exp(x). If we carefully choose L large enough that satisfies | 2 r +x L | < 1 for the all possible choices of x, then the values will not rapidly grow during the evaluation. This makes the computation more stable to the HE implementation which usually supports fixed precision bits.

Goldschmidt's algorithm
Goldschmidt's divison algorithm [51] is one of the most popular approximation algorithm to compute the inverse of a real number. For x ∈ (0, 2), Goldschmidt's algorithm uses the property as in Eq. (13).
Note that the approximation converges rapidly as d grows, and it uses 2d − 2 multiplications to evaluate the approximate polynomial of degree 2 d − 1. The small number of multiplication gives a great advantage on implementation with HE, because HE multiplication between ciphertext is hugely time consuming. The detailed algorithm is described in Algorithm 3. Here, note that the Algorithm 3 quickly converges to 1/x since the ratio between the error and true value is a d −1/x 1/x = (1 − x) 2 d+1 .

Algorithm 3: Goldschmidt Method [51]
Input : x ∈ (0, 2), number of iteration d ∈ N Output: an approximate value of 1/x 1 a 0 ← 2 − x 2 b 0 ← 1 − x 3 for n ← 0 to d − 1 do 4 b n+1 ← b 2 n 5 a n+1 ← a n · (1 + b n+1 ) 6 end 7 return a d Moreover, we can easily exploit Goldschmidt's algorithm Gol(·) to approximately evaluate 1/x on (0, 2M), instead of (0, 2), by using the equation 1 x ≈ 1 M Gol x M . We note that the scaling factor M should be carefully chosen since the error accompanied by Goldschmidt's algorithm becomes non-negligiblly large as the input is near 0. Therefore, if we select an overly large M, the input values become smaller, resulting in an error that cannot be ignored.
To put approximate exponential function and Goldschmidt's algorithm together, we now can approximately compute the softmax layer by using HE operations. Denoting L r = L 2 r 2 r , we utilize the Eq. (14) to compute approximate softmax function using HE.

Algorithm 4: Approximate Softmax Algorithm
Input : iteration number d ∈ N, the pre-determined bounds r ∈ N and L, M ∈ R, a vector v = (v 1 , · · · , v t ) ∈ (−2 r , 2 r ) t Output: an approximate output of softmax( v) 1 for i ← 1 to t do w i ← inv · w i 10 end 11 return (w 1 , · · · , w t ) The detailed algorithm is described in Algorithm 4.